This exercise uses a small subset of the data from Kaggle's Yelp Business Rating Prediction competition. It is stored in the local yelp.csv
file.
Description of the data:
yelp.csv
contains the dataset. It is stored in the repository (in the data
directory), so there is no need to download anything from the Kaggle website.Goal: Predict the star rating of a review using only the review text.
First, we read yelp.csv
into a pandas DataFrame and examine it.
In [1]:
import pandas as pd
path = 'material/yelp.csv'
yelp = pd.read_csv(path)
In [ ]:
# examine the shape
yelp.shape
In [ ]:
# examine the first row
yelp.head(1)
In [ ]:
yelp.tail(3)
In [ ]:
# only those with 5 stars
yelp[yelp['stars'] == 5]
In [ ]:
# All columns
yelp.columns
In [ ]:
# The first sample
yelp.iloc[0]
In [ ]:
# examine the class distribution
yelp.stars.value_counts().sort_index()
Create a new DataFrame that only contains the 5-star and 1-star reviews.
In [ ]:
In [ ]:
In [ ]:
In [ ]:
Calculate which 10 tokens are the most predictive of 5-star reviews, and which 10 tokens are the most predictive of 1-star reviews.
feature_count_
and class_count_
attributes of the Naive Bayes model object.
In [ ]: